Toxic Inventory Sites that are within 500 meters of a Water Source

Table of Contents

In [1]:
%matplotlib inline

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
In [2]:
# change default figsize 
plt.rcParams['figure.figsize'] = (15, 12)

Read the rivers & state data files

Rivers/Streams Data Source: The National Map Small-Scale website
This workbook requires the shapefiles found here.

In [3]:
%%time
rivers_raw = gpd.read_file('data/streaml010g/streaml010g.shp')
CPU times: user 1min 8s, sys: 5.95 s, total: 1min 14s
Wall time: 1min 15s
In [4]:
%%time
us_states = gpd.read_file('http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_20m.zip')
CPU times: user 42.3 ms, sys: 38.7 ms, total: 81 ms
Wall time: 774 ms
In [5]:
print('Rivers CRS =', rivers_raw.crs)
print('US States CRS =', us_states.crs)
Rivers CRS = {'init': 'epsg:4269'}
US States CRS = {'init': 'epsg:4269'}

Examine the dataframes

In [6]:
rivers_raw.head()
Out[6]:
Stream_ID Name Feature F_Code State State_FIPS Region ReachCode ComID From_node ... Strahler Length_mi GM_f_code GM_hyc GM_lit GM_nam GM_soc GM_exs GM_loc geometry
0 10021309 None Artificial Path 23 CT 09 1 01080205000744 7701354 0 ... 1 1.168061 BH140 8 2 UNK USA -999 -999 LINESTRING Z (-72.49288513799996 41.6033134060...
1 10021310 None Artificial Path 23 MA 25 1 01090004003114 6126641 2 ... 1 1.529835 BH140 8 2 UNK USA -999 -999 LINESTRING Z (-70.93963790199996 41.8355874390...
2 10021311 Nepaug River Artificial Path 23 CT 09 1 01080207000985 6109625 4 ... 1 0.630356 BH140 8 2 Nepaug River USA -999 -999 LINESTRING Z (-72.94474237599997 41.8183887120...
3 10021312 Nepaug River Artificial Path 23 CT 09 1 01080207000986 6109627 6 ... 1 0.465079 BH140 8 2 Nepaug River USA -999 -999 LINESTRING Z (-72.95266729499997 41.8162338130...
4 10021313 None Artificial Path 23 MA 25 1 01090002001517 5879019 7 ... 1 1.016785 BH140 8 2 UNK USA -999 -999 LINESTRING Z (-70.85520350199994 41.7691082180...

5 rows × 22 columns

In [7]:
us_states.head()
Out[7]:
STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND AWATER geometry
0 02 01785533 0400000US02 02 AK Alaska 00 1478588231566 277723861311 (POLYGON ((-173.074642 60.704657, -172.912636 ...
1 06 01779778 0400000US06 06 CA California 00 403483182192 20484637928 (POLYGON ((-118.593969 33.467198, -118.484785 ...
2 08 01779779 0400000US08 08 CO Colorado 00 268425964573 1178495763 POLYGON ((-109.059962 38.499987, -109.05996197...
3 11 01702382 0400000US11 11 DC District of Columbia 00 158351639 18675956 POLYGON ((-77.119759 38.934343, -77.0410179999...
4 16 01779783 0400000US16 16 ID Idaho 00 214048160737 2393355752 POLYGON ((-117.243027 44.390974, -117.215072 4...

Examine dataframe columns - using grouby() and stack()

In [8]:
rivers_raw.groupby('Feature').agg(['count', 'size', 'nunique']).stack()
Out[8]:
Stream_ID Name F_Code State State_FIPS Region ReachCode ComID From_node To_node Enabled Strahler Length_mi GM_f_code GM_hyc GM_lit GM_nam GM_soc GM_exs GM_loc
Feature
Aqueduct count 75 30 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
size 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
nunique 75 16 1 16 16 12 69 74 75 75 2 5 75 1 1 1 1 1 1 1
Artificial Path count 45578 29718 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578
size 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578 45578
nunique 45578 2903 1 123 123 20 41210 45233 45550 42022 2 8 45578 1 1 1 2904 1 1 1
Canal count 8898 4082 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898
size 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898 8898
nunique 8898 603 1 60 60 20 8313 8888 8762 8674 2 8 8898 1 1 1 1 1 1 1
Intracoastal Waterway count 1704 1703 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704
size 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704 1704
nunique 1704 68 1 17 17 4 1042 1533 1691 1694 2 6 1704 1 1 1 1 1 1 1
Stream count 885644 681408 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644
size 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644 885644
nunique 885644 19505 1 164 164 21 722964 885101 885534 854415 2 8 885644 1 1 1 19374 1 1 1
Stream Intermittent count 153129 111191 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129
size 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129 153129
nunique 153129 10449 1 113 113 21 137018 153105 153127 149568 1 7 153129 1 1 1 10400 1 1 1

The most incomplete data seems to be the Name of the stream/river.

check the line width

In [9]:
plt.rcParams['lines.linewidth']
Out[9]:
1.5

1.5 is a little too big for all the rivers. Alter linewidth with a new variable called line_width before plotting.

Plot the dataframes

In [10]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#448ee4'
line_width = 0.3

us_states.plot(ax=ax, edgecolor='gray', alpha=0.5, color=base_color)
rivers_raw.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
#
ax.set(xlim=(-126,-65), ylim=(24,50));

Thats alot of water! Does zooming in help?

In [11]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.5

us_states.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_raw.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);

ax.set(xlim=(-90,-81.8), ylim=(36,39.5));

Information about dataframe columns

In [12]:
print('\n\nThere are',rivers_raw['Feature'].nunique(),'unique Features in the dataframe.')
print('And',rivers_raw['F_Code'].nunique(),'unique F_Code in the dataframe.')
print('\nCounts of each Feature')
print('----------------------')
rivers_raw['Feature'].value_counts()

There are 6 unique Features in the dataframe.
And 6 unique F_Code in the dataframe.

Counts of each Feature
----------------------
Out[12]:
Stream                   885644
Stream Intermittent      153129
Artificial Path           45578
Canal                      8898
Intracoastal Waterway      1704
Aqueduct                     75
Name: Feature, dtype: int64

Unique Feature values with corrsponding F_Code

In [13]:
unique_fcode = rivers_raw['Feature'].unique()
for index, code in enumerate(unique_fcode):
    print(unique_fcode[index], rivers_raw['F_Code'].loc[rivers_raw['Feature'] == unique_fcode[index]].unique())
Artificial Path [23]
Stream [4]
Stream Intermittent [14]
Canal [5]
Aqueduct [6]
Intracoastal Waterway [8]

State column in the rivers dataframe contains more than expected

this is due to the rivers being shared with other states

In [14]:
print(sorted(rivers_raw['State'].unique())) # as the output below shows the rivers can be shared by states
['AK', 'AL', 'AL-FL', 'AL-FL-GA', 'AL-GA', 'AL-LA', 'AL-MS', 'AL-MS-TN', 'AL-TN', 'AR', 'AR-LA', 'AR-MO', 'AR-MO-TN', 'AR-MS', 'AR-OK', 'AR-OK-TX', 'AR-TN', 'AR-TX', 'AZ', 'AZ-CA', 'AZ-CA-NV', 'AZ-NM', 'AZ-NV', 'AZ-UT', 'CA', 'CA-NV', 'CA-OR', 'CO', 'CO-KS', 'CO-NE', 'CO-NM', 'CO-OK', 'CO-UT', 'CO-WY', 'CT', 'CT-MA', 'CT-NY', 'CT-RI', 'DC', 'DC-MD', 'DC-MD-VA', 'DC-VA', 'DE', 'DE-MD', 'DE-PA', 'FL', 'FL-GA', 'GA', 'GA-NC', 'GA-NC-SC', 'GA-SC', 'GA-TN', 'HI', 'IA', 'IA-IL', 'IA-IL-MO', 'IA-IL-WI', 'IA-MN', 'IA-MN-WI', 'IA-MO', 'IA-MO-NE', 'IA-NE', 'IA-NE-SD', 'IA-SD', 'IA-WI', 'ID', 'ID-MT', 'ID-NV', 'ID-OR', 'ID-UT', 'ID-WA', 'ID-WY', 'IL', 'IL-IN', 'IL-IN-KY', 'IL-KY', 'IL-MO', 'IL-WI', 'IN', 'IN-KY', 'IN-KY-OH', 'IN-MI', 'IN-OH', 'KS', 'KS-MO', 'KS-MO-NE', 'KS-NE', 'KS-OK', 'KY', 'KY-MO', 'KY-MO-TN', 'KY-OH', 'KY-OH-WV', 'KY-TN', 'KY-VA', 'KY-VA-WV', 'KY-WV', 'LA', 'LA-MS', 'LA-TX', 'MA', 'MA-NH', 'MA-NH-VT', 'MA-NY', 'MA-RI', 'MA-VT', 'MD', 'MD-PA', 'MD-VA', 'MD-VA-WV', 'MD-WV', 'ME', 'ME-NH', 'MI', 'MI-OH', 'MI-WI', 'MN', 'MN-ND', 'MN-ND-SD', 'MN-SD', 'MN-WI', 'MO', 'MO-NE', 'MO-OK', 'MO-TN', 'MS', 'MS-TN', 'MT', 'MT-ND', 'MT-ND-SD', 'MT-SD', 'MT-SD-WY', 'MT-WY', 'NC', 'NC-SC', 'NC-TN', 'NC-VA', 'ND', 'ND-SD', 'NE', 'NE-SD', 'NE-WY', 'NH', 'NH-VT', 'NJ', 'NJ-NY', 'NJ-NY-PA', 'NJ-PA', 'NM', 'NM-OK', 'NM-TX', 'NV', 'NV-OR', 'NV-UT', 'NY', 'NY-PA', 'NY-VT', 'OH', 'OH-PA', 'OH-PA-WV', 'OH-WV', 'OK', 'OK-TX', 'OR', 'OR-WA', 'PA', 'PA-WV', 'PR', 'RI', 'SC', 'SD', 'SD-WY', 'TN', 'TN-VA', 'TX', 'UT', 'UT-WY', 'VA', 'VA-WV', 'VT', 'WA', 'WI', 'WV', 'WY']

Focus dataframes on one state

In [15]:
state_df = us_states.loc[us_states['STUSPS'] == 'KY']

Use the contains method with rivers
this gets the "shared" portion of the rivers

In [16]:
rivers_df = rivers_raw.loc[rivers_raw['State'].str.contains('KY')]

Plot the new dataframe

In [17]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7

state_df.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_df.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);

Top 20 longest rivers

In [18]:
rivers_df_length = rivers_df.groupby(['Name'])['Length_mi'].agg('sum')
rivers_df_length.sort_values(ascending=False).nlargest(20)
Out[18]:
Name
Ohio River                    638.910087
Green River                   377.627973
Cumberland River              371.252152
Licking River                 290.802015
Kentucky River                255.585945
North Fork Kentucky River     156.970647
Salt River                    140.683565
Rough River                   139.423881
Barren River                  130.928145
Levisa Fork                   123.755934
Red River                     121.421766
Beech Fork                    119.444316
Tradewater River              114.569528
Rolling Fork                  104.687931
Middle Fork Kentucky River    101.168200
Eagle Creek                   100.718585
Beaver Creek                   97.923336
Nolin River                    97.023088
Tug Fork                       91.754404
Clear Creek                    86.624285
Name: Length_mi, dtype: float64

Simplfied method for identifying longest rivers

In [19]:
longest_rivers_name = rivers_df.groupby('Name')['Length_mi'].agg('sum').sort_values(ascending=False).nlargest(10).index

longest_rivers_name
Out[19]:
Index(['Ohio River', 'Green River', 'Cumberland River', 'Licking River',
       'Kentucky River', 'North Fork Kentucky River', 'Salt River',
       'Rough River', 'Barren River', 'Levisa Fork'],
      dtype='object', name='Name')

Create a dataframe that only has the longest rivers in it (using: isin)

In [20]:
longest_rivers = rivers_df.loc[rivers_df['Name'].isin(longest_rivers_name)]
longest_rivers.head()
Out[20]:
Stream_ID Name Feature F_Code State State_FIPS Region ReachCode ComID From_node ... Strahler Length_mi GM_f_code GM_hyc GM_lit GM_nam GM_soc GM_exs GM_loc geometry
153494 10017306 Levisa Fork Artificial Path 23 KY 21 5 05070202005679 1089125 154617 ... 2 0.664634 BH140 8 2 Levisa Fork USA -999 -999 LINESTRING Z (-82.36220732199996 37.4202344110...
153504 10017316 Levisa Fork Artificial Path 23 KY 21 5 05070202005684 1089115 154618 ... 2 2.219310 BH140 8 2 Levisa Fork USA -999 -999 LINESTRING Z (-82.37007968199998 37.4270989370...
153509 10017321 Levisa Fork Artificial Path 23 KY 21 5 05070202005682 1089119 154633 ... 2 0.355789 BH140 8 2 Levisa Fork USA -999 -999 LINESTRING Z (-82.38742505099998 37.4292667500...
169135 10445672 Levisa Fork Stream 4 KY 21 5 05070202000307 1086719 170453 ... 2 0.849094 BH140 8 1 Levisa Fork USA -999 -999 LINESTRING Z (-82.42187417899999 37.4122737340...
169141 10445678 Levisa Fork Stream 4 KY 21 5 05070203000005 886737 170465 ... 4 1.572099 BH140 8 1 Levisa Fork USA -999 -999 LINESTRING Z (-82.66607515699997 37.9760865050...

5 rows × 22 columns

Plot the longest rivers over the rivers

In [21]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7
longriver_color = "#f44262"

state_df.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_df.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
longest_rivers.plot(ax=ax, color=longriver_color, linewidth=line_width, zorder=2);
ax.set(xlim=(-90,-81.8), ylim=(36,39.5));

Load toxic data inventory

from data.world
point data drawn from the EPA's toxic release inventory program

In [22]:
tri = pd.read_csv('https://query.data.world/s/3b3oi57gti4qhoexmg74sdc3ftz2te', index_col='FID')
In [23]:
tri.head()
Out[23]:
X Y REGISTRY_I PRIMARY_NA LOCATION_A CITY_NAME COUNTY_NAM FIPS_CODE STATE_CODE POSTAL_COD ... UPDATE_DAT LAST_REPOR FAC_URL PGM_SYS_ID PGM_SYS_AC INTEREST_T ACTIVE_STA PUBLIC_IND PROGRAM_UR PGM_REPORT
FID
15001 -119.86581 34.43024 110002916836 LITTON GUIDANCE AND CONTROL SYSTEMS 6769 HOLLISTER AVENUE GOLETA SANTA BARBARA 06083 CA 93117-3001 ... 2014-07-31T00:00:00.000Z 2002-06-25T00:00:00.000Z http://oaspub.epa.gov/enviro/fac_gateway.main?... 93118LTTNS6769H TRIS TRI REPORTER OPEN Y http://www.epa.gov/tri/ no data yet
15002 -81.23533 41.75468 110000385921 UNIROYAL CHEMICAL CO INC 720 FAIRPORT NURSERY ROAD PAINESVILLE LAKE 39085 OH 44077-4462 ... 2014-07-31T00:00:00.000Z 2002-06-24T00:00:00.000Z http://oaspub.epa.gov/enviro/fac_gateway.main?... 44077NRYLC720FA TRIS TRI REPORTER OPEN Y http://www.epa.gov/tri/ no data yet
15003 -83.44870 41.66416 110000384334 AUTONEUM NORTH AMERICA INC 645 N. LALLENDORF RD. OREGON LUCAS 39095 OH 43616 ... 2014-07-31T00:00:00.000Z 2012-06-29T00:00:00.000Z http://oaspub.epa.gov/enviro/fac_gateway.main?... 43616GLBND645LA TRIS TRI REPORTER OPEN Y http://www.epa.gov/tri/ no data yet
15004 -117.38566 47.75197 110000491478 TRAVIS PATTERN 1413 EAST HAWTHORNE ROAD SPOKANE SPOKANE 53063 WA 99218-3100 ... 2014-07-31T00:00:00.000Z 2014-05-30T00:00:00.000Z http://oaspub.epa.gov/enviro/fac_gateway.main?... 99207TRVSPE1413 TRIS TRI REPORTER OPEN Y http://www.epa.gov/tri/ no data yet
15005 -77.40728 37.58766 110022532623 ENNIS PAINT-RICHMOND 4400 VAWTER AVE RICHMOND HENRICO 51087 VA 23222 ... 2014-07-31T00:00:00.000Z 2014-06-25T00:00:00.000Z http://oaspub.epa.gov/enviro/fac_gateway.main?... 23222DGLSC4400V TRIS TRI REPORTER OPEN Y http://www.epa.gov/tri/ no data yet

5 rows × 27 columns

In [24]:
tri.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57630 entries, 15001 to 57000
Data columns (total 27 columns):
X             57630 non-null float64
Y             57630 non-null float64
REGISTRY_I    57630 non-null int64
PRIMARY_NA    57630 non-null object
LOCATION_A    57630 non-null object
CITY_NAME     57630 non-null object
COUNTY_NAM    57630 non-null object
FIPS_CODE     57630 non-null object
STATE_CODE    57630 non-null object
POSTAL_COD    57630 non-null object
LATITUDE83    57630 non-null float64
LONGITUDE8    57630 non-null float64
HUC8_CODE     57630 non-null object
ACCURACY_V    57630 non-null int64
COLLECT_MT    57630 non-null object
REF_POINT_    57630 non-null object
CREATE_DAT    57630 non-null object
UPDATE_DAT    56861 non-null object
LAST_REPOR    57488 non-null object
FAC_URL       57630 non-null object
PGM_SYS_ID    57630 non-null object
PGM_SYS_AC    57630 non-null object
INTEREST_T    57630 non-null object
ACTIVE_STA    57630 non-null object
PUBLIC_IND    57630 non-null object
PROGRAM_UR    57630 non-null object
PGM_REPORT    57630 non-null object
dtypes: float64(4), int64(2), object(21)
memory usage: 12.3+ MB
In [25]:
tri['HUC8_CODE'].unique()
Out[25]:
array(['18060013', '04110004', '04100010', ..., '10050004', '14080203',
       '17060203'], dtype=object)
In [26]:
from shapely.geometry import Point

convert pandas DataFrame into a GeoPandas GeoDataFrame

In [27]:
# create a Series of geometries using the point locations in the dataframe (tri)
geoms = [Point(xy) for xy in zip(tri.X, tri.Y)]
crs = {'init' :'epsg:4326'}
# use GeoPandas to create a GeoDataFrame and assign a CRS
tri_geo = gpd.GeoDataFrame(tri, crs=crs, geometry=geoms)
In [28]:
tri_geo.crs
Out[28]:
{'init': 'epsg:4326'}

plot the GeoPandas dataframe

In [29]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7

us_states.plot(ax=ax, edgecolor='white', color='#f2f2f2', zorder=0);
rivers_raw.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_geo.plot(ax=ax, color='orange', zorder=2, markersize=.04);

ax.set(xlim=(-128,-65), ylim=(23,51))
plt.title('EPA Toxic Sites Plotted Over US Streams/Rivers', fontsize=20, color='#295b97');

Examine area of Kentucky

In [30]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7
marker_size = 2

state_df.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_df.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_geo.plot(ax=ax, color='orange', zorder=2, markersize=marker_size);

ax.set(xlim=(-90,-81.8), ylim=(36,39.5));  # limit the frame to the area of state geodataframe

Create a "clipping mask" using the outline of state in state geodataframe

the mask is used to eliminate points outside the state borders

check if the geom is not composed of both Polygon and Multipolygon types
a combination of both creates prolems for GeoPandas

In [31]:
state_df.geom_type.unique()
Out[31]:
array(['Polygon'], dtype=object)
In [32]:
rivers_df.geom_type.unique()
Out[32]:
array(['LineString'], dtype=object)

create a polygon to be used as a clipping mask

using the Shapely .unary_union method will return a single Shapely Polygon feature

In [33]:
state_poly = state_df.geometry.unary_union

intersect toxic inventory with state clipping mask

using the GeoPandas .intersects() method:

In [34]:
# create new GeoDataFrame of points that intersect with the clipping polygon
state_toxic_places = tri_geo[tri_geo.geometry.intersects(state_poly)]

Plot the new geo layers

In [35]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7

state_df.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_df.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
state_toxic_places.plot(ax=ax, color='orange', zorder=2, markersize=2);
ax.set(xlim=(-90,-81.8), ylim=(36,39.5));

Prepare the data to measure distance

goal is to identify stream possibly impacted my toxic places
To solve this, perform some coordinate operations on the datasets and project them to an equidistant conic projection. Use a modified Proj4 string adjusted to center the meridian on the US and minimize distortion between two standard parallels:

In [36]:
state_eqdc = state_df.to_crs('+proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs')
tri_eqdc = state_toxic_places.to_crs('+proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs')
rivers_eqdc = rivers_df.to_crs('+proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs ')

check the plot of the reprojected dataframes

In [37]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_eqdc.plot(ax=ax, color='orange', zorder=2, markersize=2);
In [38]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 2

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_eqdc.plot(ax=ax, color='orange', zorder=2, alpha=0.8, markersize=.5);
ax.set(xlim=(790000,750000), ylim=(-210000,-190000));

Use GeoPandas .buffer() method to create a new column in the existing GeoDataFrame

create a new column and assign it the result of the buffer operation

the result will be in meters

In [39]:
tri_eqdc['buffer'] = tri_eqdc.buffer(500)
# note they are now polygons in meters
tri_eqdc['buffer'].head()
Out[39]:
FID
15120    POLYGON ((893863.883211808 -40753.68460501776,...
15209    POLYGON ((910176.5330722714 -25534.61646341437...
15214    POLYGON ((855667.7649079863 -173274.0803310149...
15413    POLYGON ((920363.1437152165 -76393.76624178597...
15425    POLYGON ((891782.4351570779 -35457.18508778657...
Name: buffer, dtype: object

make the new buffer column the active geometry to plot the data

In [40]:
print('The current geometry for plots: ' + tri_eqdc.geometry.name)
# set the buffer col as the active geometry
tri_eqdc = tri_eqdc.set_geometry('buffer')
# now the geometry is the buffer column
print('The new geometry for plots: ' + tri_eqdc.geometry.name)
The current geometry for plots: geometry
The new geometry for plots: buffer

Replot with 500 meter buffer circles

In [41]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7
marker_size = 1

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_eqdc.plot(ax=ax, color='orange', zorder=2, markersize=marker_size);
In [42]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.7
marker_size = 0.08

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_eqdc.plot(ax=ax, color='orange', zorder=2, alpha=0.4, markersize=marker_size);
ax.set(xlim=(790000,750000), ylim=(-210000,-190000));

Create Line Polygon Intersection

gpd.sjoin(line_gdf, poly_gdf, op='intersects')

In [43]:
tri_stream_df = gpd.sjoin(tri_eqdc, rivers_eqdc, op='intersects')
/Users/mark/anaconda3/envs/sandbox/lib/python3.7/site-packages/geopandas/tools/sjoin.py:56: UserWarning: CRS of frames being joined does not match!(+proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs != +proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs )
  '(%s != %s)' % (left_df.crs, right_df.crs))
/Users/mark/anaconda3/envs/sandbox/lib/python3.7/site-packages/numpy/lib/function_base.py:2167: RuntimeWarning: invalid value encountered in ? (vectorized)
  outputs = ufunc(*inputs)
In [44]:
tri_stream_df.crs
Out[44]:
'+proj=eqdc +lat_0=39 +lon_0=-96 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs'
In [45]:
tri_stream_df.head()
Out[45]:
X Y REGISTRY_I PRIMARY_NA LOCATION_A CITY_NAME COUNTY_NAM FIPS_CODE STATE_CODE POSTAL_COD ... Enabled Strahler Length_mi GM_f_code GM_hyc GM_lit GM_nam GM_soc GM_exs GM_loc
5035 -88.39506 37.05202 110000380043 GERDAU AMERISTEEL US INC 1035 SHAR-CAL RD. CALVERT CITY MARSHALL 21157 KY 42029 ... 1 5 1.991328 BH140 8 1 Tennessee River USA -999 -999
5035 -88.39506 37.05202 110000380043 GERDAU AMERISTEEL US INC 1035 SHAR-CAL RD. CALVERT CITY MARSHALL 21157 KY 42029 ... 1 5 0.363205 BH140 8 1 Tennessee River USA -999 -999
5035 -88.39506 37.05202 110000380043 GERDAU AMERISTEEL US INC 1035 SHAR-CAL RD. CALVERT CITY MARSHALL 21157 KY 42029 ... 1 5 1.667725 BH140 8 1 Tennessee River USA -999 -999
5230 -85.92649 36.97383 110003218704 JL FRENCH GLASGOW PLANT #1 20 PRESTWICK DRIVE GLASGOW BARREN 21009 KY 42141-8254 ... 1 1 3.903664 BH140 8 1 South Fork Beaver Creek USA -999 -999
38705 -85.93332 36.97622 110000380515 FEDERAL-MOGUL VSP 20 ABERDEEN ROAD GLASGOW BARREN 21009 KY 42141-8238 ... 1 1 3.903664 BH140 8 1 South Fork Beaver Creek USA -999 -999

5 rows × 51 columns

In [46]:
streamID = tri_stream_df['Stream_ID'].astype(str)
streamFromNode = tri_stream_df['From_node'].astype(str)
streamToNode = tri_stream_df['To_node'].astype(str)
combinecols = streamID + streamFromNode + streamToNode

tri_stream_df['master_ID']=combinecols
tri_stream_df['master_ID'].head()
Out[46]:
5035     10261902200327200328
5035     10263382201662200327
5035     10263352201469201662
5230     10454104186054186055
38705    10454104186054186055
Name: master_ID, dtype: object
In [47]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.9
marker_size = 4

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_stream_df.plot(ax=ax, color='orange', zorder=2, markersize=marker_size);

Closeup look at buffer poly - line intersections

In [48]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.9
marker_size = 0.008

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_stream_df.plot(ax=ax, color='orange', zorder=2, alpha=0.8, markersize=marker_size);

ax.set(xlim=(790000,750000), ylim=(-210000,-190000));

stats before the intersect and after

In [49]:
print('\nBefore the intersect the dataframe tri_eqdc had', tri_eqdc.shape[0], 'rows and', tri_eqdc.shape[1], 'columns\n')
print('The new intersected tri_streams_df has', tri_stream_df.shape[0], 'rows and', tri_stream_df.shape[1], 'columns' )
Before the intersect the dataframe tri_eqdc had 1001 rows and 29 columns

The new intersected tri_streams_df has 185 rows and 52 columns
In [50]:
print(len(tri_stream_df['Name'].unique()), 'unique stream/rivers names have an intersection with a toxic inventory site')
63 unique stream/rivers names have an intersection with a toxic inventory site

build a list of unique streams that intersect with the buffer

In [51]:
tri_streams_IndexValList = tri_stream_df.index
tri_streams_NameList = tri_stream_df['Name'].tolist()
tri_streams_FromNodeList = tri_stream_df['From_node'].tolist()
tri_streams_ToNodeList = tri_stream_df['To_node'].tolist()
tri_streams_Stream_ID = tri_stream_df['Stream_ID'].tolist()
In [52]:
# list of lists (tuple)
tri_streams=[]
for index in range(len(tri_streams_IndexValList)):
    tri_streams.append([tri_streams_Stream_ID[index], tri_streams_FromNodeList[index], 
                        tri_streams_ToNodeList[index]])
In [53]:
# One list
tri_streams=[]
for index in range(len(tri_streams_IndexValList)):
    master_ID = str(tri_streams_Stream_ID[index]) + str(tri_streams_FromNodeList[index]) +  str(tri_streams_ToNodeList[index])
    tri_streams.append(master_ID)
In [54]:
# works for list of lists (tuple)
#tri_streams_unique=[]
#tri_streams_unique = [list(x) for x in set(tuple(x) for x in tri_streams)]
In [55]:
tri_streams_unique=[]
tri_streams_unique = [list(x) for x in set(x for x in tri_streams)]
In [56]:
print('The list count with duplicates: ', len(tri_streams))
print('The list count without duplicates: ',len((tri_streams_unique)))
The list count with duplicates:  185
The list count without duplicates:  154

Add a master_ID column to the rivers_eqdc

In [57]:
cols = ['Stream_ID', 'From_node', 'To_node']
rivers_eqdc[cols].head()
Out[57]:
Stream_ID From_node To_node
153476 10017288 154565 154605
153494 10017306 154617 154618
153503 10017315 154631 154632
153504 10017316 154618 154633
153509 10017321 154633 154637
In [58]:
streamID = rivers_eqdc['Stream_ID'].astype(str)
streamFromNode = rivers_eqdc['From_node'].astype(str)
streamToNode = rivers_eqdc['To_node'].astype(str)
combinecols = streamID + streamFromNode + streamToNode

rivers_eqdc['master_ID']=combinecols
rivers_eqdc['master_ID'].head()
Out[58]:
153476    10017288154565154605
153494    10017306154617154618
153503    10017315154631154632
153504    10017316154618154633
153509    10017321154633154637
Name: master_ID, dtype: object

Create dataframe of river sections that intersected with points

In [59]:
toxic_rivers = rivers_eqdc.loc[rivers_eqdc['master_ID'].isin(tri_streams)]
cols = ['Name','Stream_ID', 'From_node', 'To_node', 'master_ID']
toxic_rivers[cols].head()
Out[59]:
Name Stream_ID From_node To_node master_ID
171401 Levisa Fork 10445692 172465 172466 10445692172465172466
171628 None 10445920 172808 172862 10445920172808172862
171800 Levisa Fork 10446092 172466 173102 10446092172466173102
171801 Levisa Fork 10446093 173103 172855 10446093173103172855
171819 Johns Creek 10446111 173124 173125 10446111173124173125

Plot the identified stream portions that intersect with the buffer

In [60]:
fig, ax = plt.subplots()

base_color = '#f0f0f0'
border_color = base_color
line_color = '#1c68c0'
line_width = 0.9
marker_size = 0.008

state_eqdc.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
rivers_eqdc.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
tri_stream_df.plot(ax=ax, color='orange', zorder=2, alpha=0.8, markersize=marker_size);
toxic_rivers.plot(ax=ax, color='red', linewidth=line_width, zorder=3);

ax.set(xlim=(750000,1090000), ylim=(-110500,55000));

Design a function to streamline processing

This function takes a state and highlights its' longest rivers in a plot

In [61]:
def one_state(state_val):
    state_df = us_states.loc[us_states['STUSPS'] == state_val]
    rivers_df = rivers_raw.loc[rivers_raw['State'].str.contains(state_val)]
    longest_rivers_name = rivers_df.groupby('Name')['Length_mi'].agg('sum').sort_values(ascending=False).nlargest(10).index
    print(longest_rivers_name)
    longest_rivers_df = rivers_df.loc[rivers_df['Name'].isin(longest_rivers_name)]
    fig, ax = plt.subplots()

    base_color = '#f0f0f0'
    border_color = base_color
    line_color = '#1c68c0'
    line_width = 0.7
    longriver_color = "#f44262"

    state_df.plot(ax=ax, edgecolor='gray', alpha=0.7, color=base_color)
    rivers_df.plot(ax=ax, color=line_color, linewidth=line_width, zorder=1);
    longest_rivers_df.plot(ax=ax, color=longriver_color, linewidth=line_width, zorder=2);
    #ax.set(xlim=(-90,-81.8), ylim=(36,39.5));
    #return state_df, rivers_df, longest_rivers_df
In [62]:
xyz = one_state('UT')
Index(['Green River', 'Colorado River', 'Sevier River', 'Cottonwood Creek',
       'Willow Creek', 'Bear River', 'Muddy Creek', 'San Juan River',
       'Price River', 'Beaver River'],
      dtype='object', name='Name')
In [ ]: